43 research outputs found

    Data Curation in Interdisciplinary and Highly Collaborative Research

    Get PDF
    This paper provides a systematic analysis of publications that discuss data curation in interdisciplinary and highly collaborative research (IHCR). Using content analysis methodology, it examined 159 publications and identified patterns in definitions of interdisciplinarity, projects’ participants and methodologies, and approaches to data curation. The findings suggest that data is a prominent component in interdisciplinarity. In addition to crossing disciplinary and other boundaries, IHCR is defined as curating and integrating heterogeneous data and creating new forms of knowledge from it. Using personal experiences and descriptive approaches, the publications discussed challenges that data curation in IHCR faces, including an increased overhead in coordination and management, lack of consistent metadata practices, and custom infrastructure that makes interoperability across projects, domains, and repositories difficult. The paper concludes with suggestions for future research

    Mental disorders over time: a dictionary-based approach to the analysis of knowledge domains

    Get PDF
    Every decade brings changes in the perceptions of normal in mental health, as well as in how abnormal is labeled, understood, and dealt with. Neurosis, hysteria, and homosexuality are just a few examples of such changes. The shifts in terminology and classifications reflect our continuous struggle with social representations and treatment of the “other.” How could we best understand mental illness categorizations and become aware of their changes over time? In this paper, we seek to address this and other questions by applying an automated dictionary-based classification approach to the analysis of relevant research literature over time. We propose to examine the domain of mental health literature with an iterative workflow that combines large-scale data, an automated classifier, and visual analytics. We report on the early results of our analysis and discuss challenges and opportunities of using the workflow in domain analysis over time

    SEAD Virtual Archive: Thin Layer for Scientific Discovery and Long-Term Preservation

    Get PDF
    Major research universities are grappling with their response to the deluge of scientific data in its big data and long tail data forms. The latter consist of many diverse and heterogeneous sets, the data are collected via diverse and specialized methods, and are stored in a variety of formats and places. University libraries and their institutional repositories have traditionally been able to handle scientific output. But long-tail scientific data introduce substantial challenges to a traditional document-based repository through its vast heterogeneity, size, and its demands for meaningful discovery and in the case of large data sets, place-based use. In this presentation we will provide a brief overview of the NSF-funded project "Sustainable Environment - Actionable Data" (SEAD), which addresses the challenges of long-tail scientific data with the focus on sustainability science. We will provide an overview of this project and of its discovery and preservation component, called SEAD Virtual Archive. This component is being developed by the Data to Insight Center team at Indiana University in collaboration with IU and UIUC libraries. We will describe main features and our ongoing work on SEAD Virtual Archive and discuss the value and importance of partnerships between data research centers, such as D2I, and the libraries

    Data curators at work: Focus on projects and experiences

    Full text link
    Editor's Summary Three postdoctoral fellows in a program sponsored by the Council on Library and Information Resources/Digital Library Foundation are exploring and contributing to the field of digital curation through very different perspectives. With a neuroscience background, Katherine Akers is encouraging scientists to preserve and share research datasets and analyzing the use of library resources. For Inna Kouper, building cyberstructure and facilitating and promoting user engagement are primary. Matthew Lavin is working to make digital tools and approaches serve the needs of humanists, focusing on digitally conveying the physical features and histories of books. With different definitions of data and a variety of research goals, the scholars apply hybrid professional approaches to digital curation, stimulating expanded information, intellectual cross fertilization and a broader view of data, research and knowledge.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/102231/1/1720400113_ftp.pd

    Repository of NSF Funded Publications and Data Sets: "Back of Envelope" 15 year Cost Estimate

    Get PDF
    In this back of envelope study we calculate the 15 year fixed and variable costs of setting up and running a data repository (or database) to store and serve the publications and datasets derived from research funded by the National Science Foundation (NSF). Costs are computed on a yearly basis using a fixed estimate of the number of papers that are published each year that list NSF as their funding agency. We assume each paper has one dataset and estimate the size of that dataset based on experience. By our estimates, the number of papers generated each year is 64,340. The average dataset size over all seven directorates of NSF is 32 gigabytes (GB). A total amount of data added to the repository is two petabytes (PB) per year, or 30 PB over 15 years. The architecture of the data/paper repository is based on a hierarchical storage model that uses a combination of fast disk for rapid access and tape for high reliability and cost efficient long-term storage. Data are ingested through workflows that are used in university institutional repositories, which add metadata and ensure data integrity. Average fixed costs is approximately .0.90/GBover15−yearspan.Variablecostsareestimatedataslidingscaleof.0.90/GB over 15-year span. Variable costs are estimated at a sliding scale of 150 - 100pernewdatasetforup−frontcuration,or100 per new dataset for up-front curation, or 4.87 – 3.22perGB.Variablecostsreflecta3Thetotalprojectedcostofthedataandpaperrepositoryisestimatedat3.22 per GB. Variable costs reflect a 3% annual decrease in curation costs as efficiency and automated metadata and provenance capture are anticipated to help reduce what are now largely manual curation efforts. The total projected cost of the data and paper repository is estimated at 167,000,000 over 15 years of operation, curating close to one million of datasets and one million papers. After 15 years and 30 PB of data accumulated and curated, we estimate the cost per gigabyte at 5.56.This5.56. This 167 million cost is a direct cost in that it does not include federally allowable indirect costs return (ICR). After 15 years, it is reasonable to assume that some datasets will be compressed and rarely accessed. Others may be deemed no longer valuable, e.g., because they are replaced by more accurate results. Therefore, at some point the data growth in the repository will need to be adjusted by use of strategic preservation

    Repository of NSF-funded Publications and Related Datasets: “Back of Envelope” Cost Estimate for 15 years

    Get PDF
    In this back of envelope study we calculate the 15-year fixed and variable costs of setting up and running a data repository (or database) to store and serve the publications and datasets derived from research funded by the National Science Foundation (NSF). Costs are computed on a yearly basis using a fixed estimate of the number of papers that are published each year that list NSF as their funding agency. We assume each paper has one dataset and estimate the size of that dataset based on experience. By our estimates, the number of papers generated each year is 64,340. The average dataset size over all seven directorates of NSF is 32 gigabytes (GB). A total amount of data added to the repository is two petabytes (PB) per year, or 30 PB over 15 years. The architecture of the data/paper repository is based on a hierarchical storage model that uses a combination of fast disk for rapid access and tape for high reliability and cost efficient long-term storage. Data are ingested through workflows that are used in university institutional repositories, which add metadata and ensure data integrity. Average fixed costs is approximately 0.90 cents per GB over a 15-year span. Variable costs are estimated at a sliding scale of 150-100 dollars per new dataset for up-front curation, or 4.87-3.22 dollars per GB. Variable costs reflect a 3% annual decrease in curation costs as efficiency and automated metadata and provenance capture are anticipated to help reduce what are now largely manual curation efforts. The total projected cost of the data and paper repository is estimated at 167,000,000 dollars over 15 years of operation, curating close to one million of datasets and one million papers. After 15 years and 30 PB of data accumulated and curated, we estimate the cost per gigabyte at 5.56 dollars. This $167 million cost is a direct cost in that it does not include federally allowable indirect costs return (ICR). After 15 years, it is reasonable to assume that some datasets will be compressed and rarely accessed. Others may be deemed no longer valuable, e.g., because they are replaced by more accurate results. Therefore, at some point the data growth in the repository will need to be adjusted by use of strategic preservation

    SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long Term Data Preservation

    Get PDF
    Major research universities are grappling with their response to the deluge of scientific data emerging through research by their faculty. Many are looking to their libraries and the institutional repository as a solution. Scientific data introduces substantial challenges that the document-based institutional repository may not be suited to deal with. The Sustainable Environment - Actionable Data (SEAD) Virtual Archive specifically addresses the challenges of “long tail” scientific data. In this paper, we propose requirements, policy and architecture to support not only the preservation of scientific data today using institutional repositories, but also its rich access and use into the future

    Building Tools to Support Active Curation: Lessons Learned from SEAD

    Get PDF
    SEAD – a project funded by the US National Science Foundation’s DataNet program – has spent the last five years designing, building, and deploying an integrated set of services to better connect scientists’ research workflows to data publication and preservation activities. Throughout the project, SEAD has promoted the concept and practice of “active curation,” which consists of capturing data and metadata early and refining it throughout the data life cycle. In promoting active curation, our team saw an opportunity to develop tools that would help scientists better manage data for their own use, improve team coordination around data, implement practices that would serve the data better over time, and seamlessly connect with data repositories to ease the burden of sharing and publishing. SEAD has worked with 30 projects, dozens of researchers, and hundreds of thousands of files, providing us with ample opportunities to learn about data and metadata, integrating with researchers’ workflows, and building tools and services for data. In this paper, we discuss the lessons we have learned and suggest how this might guide future data infrastructure development efforts.National Science Foundation #OCI0940824Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/140714/1/document.pdfDescription of document.pdf : Main Articl

    SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long-Term Data Preservation in Sustainability Science

    Get PDF
    Major research universities are grappling with their response to the deluge of scientific data emerging through research by their faculty. Many are looking to their libraries and the institutional repositories for a solution. Scientific data introduces substantial challenges that the document-based institutional repository may not be suited to deal with. The Sustainable Environment - Actionable Data (SEAD) Virtual Archive (VA) specifically addresses the challenges of ‘long tail’ scientific data. In this paper, we propose requirements, policy and architecture to support not only the preservation of scientific data today using institutional repositories, but also rich access to data and their use into the future
    corecore